18 May 2017

Automatically generated data

With this data we might like to:

  • Look for trends over time

With this data we might like to:

  • Compare different moments in time

With this data we might like to:

Do other time series analysis

  • Look for seasonality
  • Fit ARIMA models
  • Calculate a moving average

But also other type of analyses that involve processing timestamp data.

However our data looks like this

library(padr)
library(dplyr)
padr::emergency %>% head
## # A tibble: 6 x 6
##        lat       lng   zip                   title          time_stamp
##      <dbl>     <dbl> <int>                   <chr>              <dttm>
## 1 40.29788 -75.58129 19525  EMS: BACK PAINS/INJURY 2015-12-10 17:40:00
## 2 40.25806 -75.26468 19446 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00
## 3 40.12118 -75.35198 19401     Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00
## 4 40.11615 -75.34351 19401  EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01
## 5 40.25149 -75.60335    NA          EMS: DIZZINESS 2015-12-10 17:40:01
## 6 40.25347 -75.28324 19446        EMS: HEAD INJURY 2015-12-10 17:40:01
## # ... with 1 more variables: twp <chr>

However our data looks like this

coffee
##            time_stamp amount
## 1 2016-07-07 09:11:21   3.14
## 2 2016-07-07 09:46:48   2.98
## 3 2016-07-09 13:25:17   4.11
## 4 2016-07-10 10:45:11   3.14

padr helps out with two challenges

Every row is a single observation, typically on second level. You want to do analysis on a (much) higher level.

  • padr offers: thicken. Used in conjunction with a database package, like dplyr or data.table.
emergency %>% 
  thicken(interval = "month") %>% 
  count(time_stamp_month) %>% 
  head()
## # A tibble: 6 x 2
##   time_stamp_month     n
##             <date> <int>
## 1       2015-12-01  7969
## 2       2016-01-01 13205
## 3       2016-02-01 11467
## 4       2016-03-01 11101
## 5       2016-04-01 11326
## 6       2016-05-01 11423

padr helps out with two challenges

When there is no observation, there is no record.

  • padr offers: pad
data.frame(dt = as.Date(c("2017-02-23", "2017-02-26")), 
           val = c(2, 4)) %>% 
  pad(interval = "day")
## pad applied on the interval: day
##           dt val
## 1 2017-02-23   2
## 2 2017-02-24  NA
## 3 2017-02-25  NA
## 4 2017-02-26   4

The interval

Think of timedata as having a hearbeat. It produces data at a certain interval.

padr currently uses eight intervals: year, quarter, month, week, day, hour, minute, and second.

get_interval(emergency$time_stamp)
## [1] "sec"

The interval is the highest of the eight that can explain all the instances observed in the data.

dt <- as.Date(c("2017-02-23", "2017-02-26", "2017-02-27"))
all(dt %in% seq(dt %>% min, dt %>% max, by = 'day'))
## [1] TRUE

The interval unit

This week v0.3.0 came out on CRAN. The interval is widened, it now allows for units other than 1, within each interval.

as.Date(c("2017-05-12", "2017-05-14", "2017-05-18")) %>% 
  get_interval()
## [1] "2 day"
as.POSIXct(c("2017-05-14 09:00:00", "2017-05-14 09:00:05", 
             "2017-05-14 09:00:20")) %>% 
  get_interval()
## [1] "5 sec"

thicken

The thicken function takes in a data frame, then it does:

  • look for the datetime variable in the data frame.
  • assess the interval of this variable.
  • span a variable of a higher interval and unit around it.
  • assign each original observation to a value in the spanned variable.
  • add the the assignments to the original data frame.

thicken example

coffee %>% 
  thicken(interval = "day")
##            time_stamp amount time_stamp_day
## 1 2016-07-07 09:11:21   3.14     2016-07-07
## 2 2016-07-07 09:46:48   2.98     2016-07-07
## 3 2016-07-09 13:25:17   4.11     2016-07-09
## 4 2016-07-10 10:45:11   3.14     2016-07-10

thicken example

coffee %>%
  thicken(interval = "15 min", rounding = "up")
##            time_stamp amount   time_stamp_15_min
## 1 2016-07-07 09:11:21   3.14 2016-07-07 09:15:00
## 2 2016-07-07 09:46:48   2.98 2016-07-07 10:00:00
## 3 2016-07-09 13:25:17   4.11 2016-07-09 13:30:00
## 4 2016-07-10 10:45:11   3.14 2016-07-10 11:00:00

thicken

thicken

thicken parameters:

x

interval

colname = NULL

rounding = c("down", "up")

by = NULL

start_val = NULL

pad

The pad function takes in a data frame, then it does:

  • look for the datetime variable in the data frame.
  • assess the interval of this variable.
  • span a variable of the same interval and unit.
  • merge the original variable with the spanned variable.
  • leave NA values for the other variables.

pad example

coffee %>% 
  thicken(interval = "day", colname = "d") %>% 
  count(d) %>% 
  pad()
## pad applied on the interval: day
## # A tibble: 4 x 2
##            d     n
##       <date> <int>
## 1 2016-07-07     2
## 2 2016-07-08    NA
## 3 2016-07-09     1
## 4 2016-07-10     1

pad

pad

x

interval = NULL

start_val = NULL

end_val = NULL

by = NULL

group = NULL

break_above = 1

pad

Padding can thus also be done within a grouping variable.

emergency %>% 
  thicken('month', col = "m") %>% 
  count(title, m) %>% 
  pad(group = "title", 
      start_val = as.Date("2015-12-01"),
      end_val   = as.Date("2016-10-01"))
## pad applied on the interval: month
## Source: local data frame [1,287 x 3]
## Groups: title [117]
## 
##                   title          m     n
##                   <chr>     <date> <int>
## 1  EMS: ABDOMINAL PAINS 2015-12-01   128
## 2  EMS: ABDOMINAL PAINS 2016-01-01   186
## 3  EMS: ABDOMINAL PAINS 2016-02-01   161
## 4  EMS: ABDOMINAL PAINS 2016-03-01   184
## 5  EMS: ABDOMINAL PAINS 2016-04-01   185
## 6  EMS: ABDOMINAL PAINS 2016-05-01   162
## 7  EMS: ABDOMINAL PAINS 2016-06-01   158
## 8  EMS: ABDOMINAL PAINS 2016-07-01   143
## 9  EMS: ABDOMINAL PAINS 2016-08-01   176
## 10 EMS: ABDOMINAL PAINS 2016-09-01   174
## # ... with 1,277 more rows

Fill the missings

After padding you are left with the missing values for the imputed records.

padded_df <- data.frame(dt  = as.Date(c("2017-02-23", "2017-02-25", 
                           "2017-02-28")), val = c(2, 4, 2)) %>% pad()
## pad applied on the interval: day
padded_df
##           dt val
## 1 2017-02-23   2
## 2 2017-02-24  NA
## 3 2017-02-25   4
## 4 2017-02-26  NA
## 5 2017-02-27  NA
## 6 2017-02-28   2

Fill the missings

Depending on the nature of the data you might want to:

Carry the last value forward

padded_df %>% 
  tidyr::fill(val)
##           dt val
## 1 2017-02-23   2
## 2 2017-02-24   2
## 3 2017-02-25   4
## 4 2017-02-26   4
## 5 2017-02-27   4
## 6 2017-02-28   2

Fill the missings

Depending on the nature of the data you might want to:

Fill all the missings with the same value

padded_df %>% 
  fill_by_value(val, value = 42)
##           dt val
## 1 2017-02-23   2
## 2 2017-02-24  42
## 3 2017-02-25   4
## 4 2017-02-26  42
## 5 2017-02-27  42
## 6 2017-02-28   2

Fill the missings

Depending on the nature of the data you might want to:

Fill all the missings with a function of the nonmissings

padded_df %>% 
  fill_by_function(val, fun = mean)
##           dt      val
## 1 2017-02-23 2.000000
## 2 2017-02-24 2.666667
## 3 2017-02-25 4.000000
## 4 2017-02-26 2.666667
## 5 2017-02-27 2.666667
## 6 2017-02-28 2.000000

Fill the missings

Depending on the nature of the data you might want to:

Fill all the missings with the most prevalent of the nonmissings

padded_df %>% 
  fill_by_prevalent(val)
##           dt val
## 1 2017-02-23   2
## 2 2017-02-24   2
## 3 2017-02-25   4
## 4 2017-02-26   2
## 5 2017-02-27   2
## 6 2017-02-28   2

Final example to wrap up

library(ggplot2)
animal_bites_plot <- 
  emergency %>% 
  filter(title == 'EMS: ANIMAL BITE') %>% 
  thicken(interval = 'day', col = 'ts_day') %>% 
  count(ts_day) %>% 
  pad() %>% 
  fill_by_value(n) %>% 
  ggplot(aes(ts_day, n)) +
  geom_point() +
  geom_line() +
  geom_smooth()
## pad applied on the interval: day

Final example to wrap up

animal_bites_plot

More information

There are two vignettes, a general introduction and more details on the implementation.

vignette("padr")
vignette("padr_implementation")

I blog about changes in padr on: thats-so-random.com

And the package is maintained on: github.com/EdwinTh/padr